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Abstract 



This investigation of the comparability of writing assessment prompts was conducted in two 
phases. In an exploratory Phase I, 47 writing prompts administered in the computer-based Test of 
English as a Foreign Language™ (TOEFL® CBT) from July through December 1998 were 
examined. Logistic regression procedures were used to estimate prompt difficulty and gender 
effects. A panel of experts reviewed selected prompts, and a taxonomy of prompt characteristics 
was developed and related to prompt difficulty and gender differences. In Phase II, 87 prompts 
administered from July 1998 through March 2000 were analyzed. All of the prompts used in 
Phase I, together with 40 new prompts, were analyzed using the larger Phase II database. 
Recommendations are made for statistical quality control procedures to identify less comparable 
prompts. 

Key words: computer-based writing assessment, essay prompts, comparability, fairness, 
polytomous DIF (differential item functioning), gender, logistic regression, proportional odds- 
ratio model 
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The Test of English as a Foreign Language™ (TOEFL" ) was developed in 1963 by the National 
Council on the Testing of English as a Foreign Language. The Council was formed through the 
cooperative effort of more than 30 public and private organizations concerned with testing the English 
proficiency of nonnative speakers of the language applying for admission to institutions in the United 
States. In 1965, Educational Testing Service" (ETS®) and the College Board" assumed 
joint responsibility for the program. In 1973, a cooperative arrangement for the operation of the 
program was entered into by ETS, the College Board, and the Graduate Record Examinations (GRE" ) 
Board. The membership of the College Board is composed of schools, colleges, school systems, and 
educational associations; GRE Board members are associated with graduate education. 

ETS administers the TOEFL program under the general direction of a policy board that was 
established by, and is affiliated with, the sponsoring organizations. Members of the TOEFL Board 
(previously the Policy Council) represent the College Board, the GRE Board, and such institutions and 
agencies as graduate schools of business, junior and community colleges, nonprofit educational 
exchange agencies, and agencies of the United States government. 



A continuing program of research related to the TOEFL test is carried out under the direction of the 
TOEFL Committee of Examiners. Its 12 members include representatives of the TOEFL Board and 
distinguished English as a second language specialists from the academic community. The Committee 
meets twice yearly to review and approve proposals for test-related research and to set 
guidelines for the entire scope of the TOEFL research program. Members of the Committee of 
Examiners serve four-year terms at the invitation of the Board; the chair of the committee serves on 
the Board. 



Because the studies are specific to the TOEFL test and the testing program, most of the actual research 
is conducted by ETS staff rather than by outside researchers. Many projects require the cooperation 
of other institutions, however, particularly those with programs in the teaching of English as a foreign 
or second language and applied linguistics. Representatives of such programs who are interested in 
participating in or conducting TOEFL-related research are invited to contact the TOEFL program 
office. All TOEFL research projects must undergo appropriate ETS review to ascertain that data 
confidentiality will be protected. 
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Introduction 



The focus of the present investigation was on the comparability of the prompts used in 
writing skill assessment. It is important to examine the comparability of prompts in the 
computer-based Test of English as a Foreign Language™ (TOEFL® CBT) because each 
examinee receives only a single prompt, and this prompt is generally not the same for all 
examinees. If the prompts are not comparable in difficulty, those examinees receiving the most 
difficult prompts would be disadvantaged and those receiving the least difficult prompts would 
be advantaged. To our knowledge, no statistical method exists to control for difficulty when only 
one item or prompt is administered to each examinee, and yet it is important in testing programs 
to ensure that all examinees are administered tests of equivalent difficulty. For these reasons, 
Stansfield and Ross (1988), in their long-term research agenda for TOEFL writing assessment, 
stated that the highest priority should be given to the issue of comparability of scores obtained 
for different writing prompts. The comparability of prompts for different groups of examinees, 
such as gender groups, is also of importance. 

Gender differences on free-response writing examinations have tended to favor females, 
but the magnitude of gender differences varies across populations of examinees. For example, 
the National Assessment of Educational Progress (1994) has reported gender differences in 
performance on essay tests for national random samples exceeding one -half of a standard 
deviation in grades 8 and 12. Similar results have been reported for some statewide examinations 
at grade 8 (Englehard, Gordon, & Gabrielson, 1991). In college-bound populations, gender 
differences in performance on essay tests of writing skill have been much smaller, ranging from 
a little over one-tenth to about one-third of a standard deviation, but still favoring females 
(Breland & Griswold, 1982; Breland & Jones, 1982; Bridgeman & Bonner, 1994). In graduate 
school applicant populations, females have averaged about one-tenth of a standard deviation 
higher on essay tests than males (Bridgeman & McHale, 1996: Schaeffer, Briel, & Fowles, 

2001 ). 

Females also tend to score slightly higher than males on writing tests in populations for 
whom English is a second language. In random samples of Test of English as a Foreign 
Language (TOEFL) examinees, female scores on an essay test averaged about one-tenth of a 
standard deviation higher than those for males (Golub-Smith, Reese, & Steinhaus, 1993). Two 
additional studies of ESL students have yielded higher scores for ESL females at the elementary 
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school level. Bermudez and Prater (1994) found that essays written by Hispanic females showed 
a greater degree of elaboration and received higher holistic scores. Heck and Crislip (2001) 
studied third grade students in Hawaii and found that females scored higher on both direct and 
indirect measures of writing skill than did males. 

It is not uncommon in psychology, and education more generally, to observe that females 
perform better on free-response tasks. In a paper on verbal fluency differences in school-age 
children, Sincoff and Sternberg (1988) reported that girls, especially those above age 11, scored 
higher than boys on verbal fluency tasks. In a summary paper, Sincoff and Sternberg (1987) 
discuss two types of verbal ability: verbal fluency and verbal comprehension. Verbal fluency is 
needed primarily for writing and speaking tasks, while verbal comprehension is needed primarily 
for reading and listening tasks. Berninger and Fuller (1992) studied written compositions of first, 
second, and third grade students and found that boys were at greater risk for writing disabilities. 

In the Golub-Smith et al. study, the comparability of prompts used for the Test of Written 
English™ (TWE®) was examined. Eight different prompts were spiraled (that is, administered 
in a random or near-random manner worldwide at the October 1989 TOEFL administration, with 
each prompt eliciting approximately 10,000 essays. The results of the analyses conducted 
indicated small differences in mean scores obtained from some of the prompts, but the 
investigators had difficulty making definitive statements regarding the meaningfulness of the 
observed differences. While many of the observed differences in means were so small as to be of 
no practical significance, differences observed across prompts in the number of examinees at 
each score level were not. The study suggested that these score distribution differences may 
warrant further investigation. 

Other testing programs have also conducted studies of prompt differences. Pomplun, 
Wright, Oleka, and Sudlow (1992) studied prompts used for the College Board’s English 
Composition Test (ECT) with Essay. This study used ECT data for seven prompts administered 
during the years 1983 to 1990. Differential difficulty was explored through linear regressions of 
essay scores on objective scores for different sex, language, and ethnic groups. The results of 
these analyses indicated that, generally, the regressions were consistent across years but that two 
of the seven prompts studied contained characteristics that may have been related to differential 
performance. In one of the two identified prompts, the topic of heroes and values may have 
favored groups more familiar with cultural values. In the other prompt identified, the 
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combination of an abstract topic with an ironic tone may have caused differential performance 
for those with lower language skills. Further study was recommended of the nature of essay 
performance of minority and ESL groups. 

Although their primary objective was not the study of prompt comparability, other 
studies have yielded results that are informative concerning the differential difficulty of prompts 
or the testing of foreign-language populations more generally. Mazzeo, Schmitt, and Bleistein 
(1993) compared the performance of gender and ethnic groups on the essay and multiple-choice 
components of Advanced Placement examinations. The results suggested that topic variability 
may have a greater effect than the variability associated with particular question types or broadly 
defined content areas. Questions based on passages related to topics such as patriotism, space 
satellites, and the ruggedness of the American prairie produced the largest group differences, 
which favored males. 

In a comprehensive review of measurement issues related to gender, Willingham and 
Cole (1997) noted that the specific topic of an essay assessment may affect the performance of 
different genders. For example, on the Advanced Placement English Language and Composition 
examination, some topics seemed to favor males, while others favored females. White women 
performed better than white men on a question that required an evaluation of an assertion about 
human nature. White men performed better than white women on a topic that asked them to 
compare the styles of passages written by Native Americans about the harshness of the American 
prairie (p. 191). 

While these previous studies have contributed to an understanding of problems in the 
assessment of English language writing, they have all been limited by the availability of prompts 
and by sampling restrictions. Prior to the introduction of computer-based testing, a single prompt 
was often associated with a single test administration date for TOEFL. There were thus 
unavoidable confoundings of prompts and samples, which made the comparison of prompts 
difficult at best. With the new TOEFL CBT administrations, numerous prompts are administered 
in a random (or near random) fashion to widely varying populations. These new CBT 
administrations thus offer an opportunity to examine prompt comparability with a rigor that has 
heretofore been impossible. 

Analyses of this type could have important implications for test development. If 
distinctive patterns are observed for different prompts, these patterns could guide prompt 
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